STAT 679: Problem Set #3
Q1. Glacial Lakes
The data at this link contain labels of glacial lakes the Hindu Kush Himalaya, created during an ecological survey in 2015 by the International Centre for Integrated Mountain Development.
Part (a)
a.1 How many lakes are in this dataset?
There are 3624 lakes in this data set.
a.2 What are the latitude / longitude coordinates of the largest lakes in each Sub-basin?
largest_lakes = group_by(lakes, Sub_Basin) %>%
filter(Area == max(Area)) %>%
select(GL_ID,Sub_Basin,Area,Latitude,Longitude)| GL_ID | Sub_Basin | Area | Latitude | Longitude | geometry |
|---|---|---|---|---|---|
| GL086304E28374N | Arun | 4.0192059 | 28.37403 | 86.30475 | POLYGON ((86.29541 28.35189… |
| GL085838E28322N | Sun Koshi | 5.4113912 | 28.32223 | 85.83813 | POLYGON ((85.84835 28.32543… |
| GL086925E27898N | Dudh Koshi | 1.3428310 | 27.89853 | 86.92510 | POLYGON ((86.9127 27.89883,… |
| GL087866E27869N | Tamor | 0.6961043 | 27.86953 | 87.86618 | POLYGON ((87.85986 27.86149… |
| GL086447E27946N | Tama Koshi | 1.7197430 | 27.94679 | 86.44713 | POLYGON ((86.43581 27.93609… |
| GL085717E28042N | Indrawati | 0.0269401 | 28.04209 | 85.71757 | POLYGON ((85.71581 28.04205… |
| GL086542E27713N | Likhu | 0.0903048 | 27.71321 | 86.54286 | POLYGON ((86.54486 27.71284… |
| GL082948E29196N | Bheri | 4.9235690 | 29.19634 | 82.94822 | POLYGON ((82.94605 29.21763… |
| GL082423E29384N | Tila | 0.4322915 | 29.38408 | 82.42375 | POLYGON ((82.42288 29.38101… |
| GL081554E29648N | Karnali | 0.1262245 | 29.64815 | 81.55488 | POLYGON ((81.55656 29.64652… |
| GL082414E29753N | Mugu | 0.4283468 | 29.75376 | 82.41434 | POLYGON ((82.41277 29.75636… |
| GL081526E29772N | West Seti | 0.5075936 | 29.77282 | 81.52618 | POLYGON ((81.52834 29.77348… |
| GL081577E29897N | Kawari | 0.2839747 | 29.89755 | 81.57738 | POLYGON ((81.57937 29.89932… |
| GL081780E30128N | Humla | 0.7565747 | 30.12892 | 81.78076 | POLYGON ((81.77613 30.13338… |
| GL080178E30564N | Kali | 0.2396768 | 30.56462 | 80.17877 | POLYGON ((80.18417 30.56559… |
| GL084116E28446N | Seti | 0.1024868 | 28.44681 | 84.11694 | POLYGON ((84.11637 28.44483… |
| GL085519E28467N | Trishuli | 0.4539165 | 28.46783 | 85.51938 | POLYGON ((85.51812 28.47, 8… |
| GL084628E28596N | Budhi Gandaki | 0.2674945 | 28.59618 | 84.62826 | POLYGON ((84.63338 28.59795… |
| GL083851E28690N | Marsyangdi | 3.3841276 | 28.69089 | 83.85184 | POLYGON ((83.84369 28.70758… |
| GL083701E29218N | Kali Gandaki | 0.4344556 | 29.21850 | 83.70156 | POLYGON ((83.7026 29.21713,… |
Part (b)
Plot the polygons associated with each of the lakes identified in part (a).
Hint: You may find it useful to split lakes across panels using
the tm_facets function. If you use this approach, make sure
to include a scale with tm_scale_bar(), so that it is clear
that lakes have different sizes.
tm_shape(largest_lakes) +
tm_polygons(col='Area', palette="Blues",legend.show=F) +
tm_facets(by='Sub_Basin', free.scales = F, ncol=5, scale.factor = 4) +
tm_scale_bar(width=0.5) +
tm_layout(
main.title = "Largest lakes in each Sub Basin", main.title.position = c('center','top'), main.title.fontfamily = plot_font,
panel.label.bg.color="darkgoldenrod4", panel.label.fontface='bold', panel.label.color = page_bg_color, panel.label.fontfamily = plot_font,
bg.color = "darkseagreen", outer.bg.color = page_bg_color,
)Part (c)
Visualize all lakes with latitude between 28.2 and 28.4 and with longitude between 85.8 and 86. Optionally, add a basemap associated with each lake.
lakes_subset = lakes %>%
filter((Latitude>=28.2 & Latitude<=28.4) & (Longitude>=85.8 & Longitude<=86.0))
basemap = cc_location(loc=c(85.9,28.3), buffer=1e4)
tm_shape(basemap) +
tm_rgb(alpha=0.9) +
tm_shape(lakes_subset) +
tm_polygons(col='deepskyblue3') +
tm_layout(bg.color=page_bg_color, inner.margins=c(0,0), frame=F, saturation=-1)Q2. Australian Pharmaceuticals II
The PBS data set contains the number of orders filed every month for different classes of pharmaceutical drugs, as tracked by the Australian Pharmaceutical Benefits Scheme. The code below takes the full PBS data set and filters down to the 10 most commonly prescribed pharmaceutical types. This problem will ask you to implement and compare two approaches to visualizing this data set.
pbs_full <- read_csv("../data/PBS_random.csv") %>%
mutate(Month = as.Date(Month))
top_atcs <- pbs_full %>%
group_by(ATC2_desc) %>%
summarise(total = sum(Scripts)) %>%
slice_max(total, n = 10) %>%
pull(ATC2_desc)
pbs <- pbs_full %>%
filter(ATC2_desc %in% top_atcs, Month > "2007-01-01")Part (a)
Implement a stacked area visualization of these data.
ggplot(pbs) +
geom_area(aes(Month,Scripts/1e6, col=ATC2_desc, fill = ATC2_desc), alpha=0.6) +
scale_x_date(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), n.breaks=11) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(
title = "Sales of 10 most commonly prescribed pharma drugs from 2007 till date",
y = "Orders sold (in millions)", fill = "CLASS OF DRUG", col = "CLASS OF DRUG"
)Part (b)
Implement an alluvial visualization of these data.
ggplot(pbs) +
geom_alluvium(aes(Month, Scripts/1e6, fill = ATC2_desc, col = ATC2_desc, alluvium = ATC2_desc), decreasing = FALSE, alpha = 0.6) +
scale_x_date(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), n.breaks=11) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(
title = "Sales of 10 most commonly prescribed pharma drugs from 2007 till date",
y = "Orders sold (in millions)", fill = "CLASS OF DRUG", col = "CLASS OF DRUG"
)Part (c)
Compare and contrast the strengths and weaknesses of these two visualization strategies. Which user queries are easier to answer using one approach vs. the other?
| Visualization Strategy | Strengths | Weakensses |
|---|---|---|
| Stacked Area |
Areas stacked on top of each other makes it easy to tell how the totals
at any given time-point breakdown. We can easily answer user queries on what drugs had the highest sales at a time point, and also identify overall trend. |
Not easy to rank the drug classes on their relative sales. |
| Alluvial |
Ranking is easy to identify. At any given time-point, the streams are
decreasingly ordered by drug sales contribution. So, user-queries on what drug has the highest/lowest sales, or change in drug sales across time-points can ve easily answered with this visualization. |
Not easy to gauge the total sales at a single glance as it works more as a comparison visualization. |
Q3. Spotify Time Series II
The code below provides the number of Spotify streams for the 40 tracks with the highest stream count in the Spotify 100 dataset for 2017. This problem will ask you to explore a few different strategies that can be used to visualize this time series collection.
spotify_full <- read_csv("../data/spotify.csv")
top_songs <- spotify_full %>%
group_by(track_name) %>%
summarise(total = sum(streams)) %>%
slice_max(total, n = 40) %>%
pull(track_name)
top_10_songs <- spotify_full %>%
group_by(track_name) %>%
summarise(total = sum(streams)) %>%
slice_max(total, n = 10) %>%
pull(track_name)
spotify <- spotify_full %>% filter(track_name %in% top_songs)
# Data Preprocessing to cut down a long song name
spotify$track_name[spotify$track_name==unique(spotify$track_name)[14]] <- "I Don’t Wanna Live Forever"
spotify$track_name = trimws(str_replace(spotify$track_name, " \\(([^)]+)\\)", ""))Part (a)
Design and implement a line-based visualization of these data.
# TODO: Save plots as images and load them in the Knit output
ggplot(spotify, aes(x=date)) +
geom_line(aes(date, streams/1e6, group=track_name),col="forestgreen", alpha=0.5,size=0.6) +
scale_x_date(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0), breaks = seq(0,12,1)) +
labs(title="Spotify's 40 top streamed songs in 2017", x="",y="Streams (in millions)") +
theme(axis.ticks.x = element_line(), axis.text.x=element_text(size=7, hjust=0.7), panel.grid.minor = element_blank())Part (b)
Design and implement a horizon plot visualization of these data.
# cutpoints = seq(0,12,1)
# mid_point = cutpoints[2]
custom_greys_palette = colorRampPalette(c("black","#EAEAEA"), space = "Lab")
custom_greens_palette = colorRampPalette(c("#BAE4B3","forestgreen"), space = "Lab")
spotify_palette = c((custom_greys_palette(8)),custom_greens_palette(4))
# spotify_palette = c(custom_greys_palette(9), rev(brewer.pal(3, "Greens")))
# TODO: Save plots as images and load them in the Knit output
ggplot(spotify, aes(date,streams/1e6)) +
geom_horizon(aes(fill=..Cutpoints..), origin = 4, horizonscale = seq(0,12,1), alpha=0.8) +
facet_wrap(~ reorder(track_name, -streams), ncol=1, strip.position = 'left') +
scale_x_date(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0), breaks = seq(0,12,3)) +
scale_fill_manual(values = spotify_palette) +
labs(title="Spotify's top streamed songs in 2017", x="",y="",fill="Streams (in millions)") +
guides(fill = guide_legend(nrow=1, reverse=T)) +
theme(
axis.ticks.x = element_line(),
axis.text.x=element_text(hjust=0.9),
axis.text.y=element_blank(),
strip.text.y.left = element_text(angle=0, hjust=0.5),
legend.position = "top")Part (c)
Building from the static views from (a - b), propose, but do not implement, a visualization of this data set that makes use of dynamic queries. What would be the structure of interaction, and how would the display update when the user provides a cue? Explain how you would use one type of D3 selection or mouse event to implement this hypothetical interactivity.
Response:
I have two ideas for dynamic queries for these temporal visualizations.
- Adding a brush interaction inside a facet. When the user brushes over a region inside one facet, then that time period will be highlighted (and probably zoomed in too) for all the song facets. This would provide a microscopic analysis of the song streams in a small period of time. I would also add in a “Reset Zoom” action button to rest to original view. A non-graphical alternative to this query is to replace the brush with a date range slider.
- A multi-selec tool to filter artists, or tracks from the datasets. Facets will appear/disappear based on the filter chosen.
The above interactive queries can also be combined together to have a highly responsive and information dense visualization.
Q4 - CalFresh Enrollment II
In this problem, we will develop an interactively linked spatial and temporal visualization of enrollment data from CalFresh, a nutritional assistance program in California. We will use D3 in this problem
Part (a)
Using a line generator, create a static line plot that visualizes change in enrollment for every county in California.